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Abstract 

This paper addresses the general problem of do- 
main adaptation which arises in a variety of appli- 
cations where the distribution of the labeled sam- 
ple available somewhat differs from that of the test 
data. Building on previous work by Ben-David 
et al. (2007), we introduce a novel distance be- 
tween distributions, discrepancy distance, that is 
tailored to adaptation problems with arbitrary loss 
functions. We give Rademacher complexity bounds 
for estimating the discrepancy distance from finite 
samples for different loss functions. Using this dis- 
tance, we derive novel generalization bounds for 
domain adaptation for a wide family of loss func- 
tions. We also present a series of novel adaptation 
bounds for large classes of regularization-based al- 
gorithms, including support vector machines and 
kernel ridge regression based on the empirical dis- 
crepancy. This motivates our analysis of the prob- 
lem of minimizing the empirical discrepancy for 
various loss functions for which we also give novel 
algorithms. We report the results of preliminary 
experiments that demonstrate the benefits of our 
discrepancy minimization algorithms for domain 
adaptation. 



1 Introduction 

In the standard PAC model (VaUant, 1984) and other the- 
oretical models of learning, training and test instances are 
assumed to be drawn from the same distribution. This is a 
natural assumption since, when the training and test distri- 
butions substantially differ, there can be no hope for gen- 
eralization. However, in practice, there are several crucial 
scenarios where the two distributions are more similar and 
learning can be more effective. One such scenario is that of 
domain adaptation, the main topic of our analysis. 

The problem of domain adaptation arises in a variety of 
applications in natural language processing (Dredze et al., 
2007; BUtzer et al., 2007; Jiang & Zhai, 2007; Chelba & 
Acero, 2006; Daume III & Marcu, 2006), speech processing 
(Legetter & Woodland, 1995; Gauvain & Chin-Hui, 1994; 
Pietra et al., 1992; Rosenfeld, 1996; Jelinek, 1998; Roark 
& Bacchiani, 2003), computer vision (Martmez, 2002), and 



many other areas. Quite often, little or no labeled data is 
available from the target domain, but labeled data from a 
source domain somewhat similar to the target as well as large 
amounts of unlabeled data from the target domain are at one's 
disposal. The domain adaptation problem then consists of 
leveraging the source labeled and target unlabeled data to 
derive a hypothesis performing well on the target domain. 

A number of different adaptation techniques have been 
introduced in the past by the publications just mentioned 
and other similar work in the context of specific applica- 
tions. For example, a standard technique used in statistical 
language modeling and other generative models for part-of- 
speech tagging or parsing is based on the maximum a pos- 
teriori adaptation which uses the source data as prior knowl- 
edge to estimate the model parameters (Roark & Bacchiani, 
2003). Similar techniques and other more refined ones have 
been used for training maximum entropy models for lan- 
guage modeling or conditional models (Pietra et al., 1992; 
Jelinek, 1998; Chelba & Acero, 2006; Daume III & Marcu, 
2006). 

The first theoretical analysis of the domain adaptation 
problem was presented by Ben-David et al. (2007), who 
gave VC-dimension-based generalization bounds for adap- 
tation in classification tasks. Perhaps, the most significant 
contribution of this work was the definition and application 
of a distance between distributions, the distance, that is 
particularly relevant to the problem of domain adaptation and 
that can be estimated from finite samples for a finite VC di- 
mension, as previously shown by Kifer et al. (2004). This 
work was later extended by Blitzer et al. (2008) who also 
gave a bound on the error rate of a hypothesis derived from a 
weighted combination of the source data sets for the specific 
case of empirical risk minimization. A theoretical study of 
domain adaptation was presented by Mansour et al. (2009), 
where the analysis deals with the related but distinct case of 
adaptation with multiple sources, and where the target is a 
mixture of the source distributions. 

This paper presents a novel theoretical and algorithmic 
analysis of the problem of domain adaptation. It builds on 
the work of Ben-David et al. (2007) and extends it in sev- 
eral ways. We introduce a novel distance, the discrepancy 
distance, that is tailored to comparing distributions in adap- 
tation. This distance coincides with the dA distance for 0-1 
classification, but it can be used to compare distributions for 
more general tasks, including regression, and with other loss 
functions. As already pointed out, a crucial advantage of the 



(Ia distance is that it can be estimated from finite samples 
when the set of regions used has finite VC-dimension. We 
prove that the same holds for the discrepancy distance and 
in fact give data-dependent versions of that statement with 
sharper bounds based on the Rademacher complexity. 

We give new generalization bounds for domain adapta- 
tion and point out some of their benefits by comparing them 
with previous bounds. We further combine these with the 
properties of the discrepancy distance to derive data-dependent 
Rademacher complexity learning bounds. We also present 
a series of novel results for large classes of regularization- 
based algorithms, including support vector machines (S VMs) 
(Cortes & Vapnik, 1995) and kernel ridge regression (KRR) 
(Saunders et al., 1998). We compare the pointwise loss of 
the hypothesis returned by these algorithms when trained on 
a sample drawn from the target domain distribution, versus 
that of a hypothesis selected by these algorithms when train- 
ing on a sample drawn from the source distribution. We show 
that the difference of these pointwise losses can be bounded 
by a term that depends directly on the empirical discrepancy 
distance of the source and target distributions. 

These learning bounds motivate the idea of replacing the 
empirical source distribution with another distribution with 
the same support but with the smallest discrepancy with re- 
spect to the target empirical distribution, which can be viewed 
as reweighting the loss on each labeled point. We analyze 
the problem of determining the distribution minimizing the 
discrepancy in both 0-1 classification and square loss regres- 
sion. We show how the problem can be cast as a linear pro- 
gram (LP) for the 0-1 loss and derive a specific efficient com- 
binatorial algorithm to solve it in dimension one. We also 
give a polynomial-time algorithm for solving this problem 
in the case of the square loss by proving that it can be cast 
as a semi-definite program (SDP). Finally, we report the re- 
sults of preliminary experiments showing the benefits of our 
analysis and discrepancy minimization algorithms. 

In section 2, we describe the learning set-up for domain 
adaptation and introduce the notation and Rademacher com- 
plexity concepts needed for the presentation of our results. 
Section 3 introduces the discrepancy distance and analyzes 
its properties. Section 4 presents our generalization bounds 
and our theoretical guarantees for regularization-based algo- 
rithms. Section 5 describes and analyzes our discrepancy 
minimization algorithms. Section 6 reports the results of our 
preliminary experiments. 

2 Preliminaries 
2.1 Learning Set-Up 

We consider the familiar supervised learning setting where 
the learning algorithm receives a sample of m labeled points 
S = (zi, . . . = ((xi,yi), . . . , {x,n,ym)) e (XxF)™, 
where X is the input space and Y the label set, which is 
{0, 1} in classification and some measurable subset of R in 
regression. 

In the domain adaptation problem, the training sample 
S is drawn according to a source distribution Q, while test 
points are drawn according to a target distribution P that 
may somewhat differ from Q. We denote by / : X the 
target labeling function. We shall also discuss cases where 



the source labeling function /g differs from the target do- 
main labeling function fp. Clearly, this dissimilarity will 
need to be small for adaptation to be possible. 

We will assume that the learner is provided with an unla- 
beled sample T drawn i.i.d. according to the target distribu- 
tion P. We denote byL: FxF^Ra loss function defined 
over pairs of labels and by Cqlf, g) the expected loss for any 
two functions f,g:X-^Y and any distribution Q over X: 

The domain adaptation problem consists of selecting a 
hypothesis h out of a hypothesis set H with a small expected 
loss according to the target distribution P, Cp{h, /). 

2.2 Rademacher Complexity 

Our generalization bounds will be based on the following 
data-dependent measure of the complexity of a class of func- 
tions. 

Definition 1 (Rademacher Complexity) Let H be a set of 

real-valued functions defined over a set X. Given a sam- 
ple S G X™, the empirical Rademacher complexity of H is 
defined as follows: 




The expectation is taken over a = (cti, . . . , cr„) where a^s 
are independent uniform random variables taking values in 
{ — 1, +1}. The Rademacher complexity of a hypothesis set 

H is defined as the expectation of 9\s {H) over all samples 
of size m: 



m,niH)^'E[msiH)\\S\=m\. (2) 

The Rademacher complexity measures the ability of a class 
of functions to fit noise. The empirical Rademacher com- 
plexity has the added advantage that it is data-dependent and 
can be measured from finite samples. It can lead to tighter 
bounds than those based on other measures of complexity 
such as the VC-dimension (Koltchinskii & Panchenko, 2000). 

We will denote by Rs{h) the empirical average of a hy- 
pothesis /i: X — > M and by R{h) its expectation over a 
sample S drawn according to the distribution considered. 
The following is a version of the Rademacher complexity 
bounds by Koltchinskii and Panchenko (2000) and Bartlett 
and Mendelson (2002). For completeness, the full proof is 
given in the Appendix. 

Theorem 2 (Rademacher Bound) Let H bea class of func- 
tions mapping Z = X x Y to [0,1] and S = (zi, . . . , Zm) 
a finite sample drawn i.i.d. according to a distribution Q. 
Then, for any S > 0, with probability at least 1 — (5 over 
samples S of size m, the following inequality holds for all 
h e H: 



R{h)<R{h)+9{s{H)+3J^. (3) 

V 2m 



3 Distances between Distributions 

Clearly, for generalization to be possible, the distribution Q 
and P must not be too dissimilar, thus some measure of the 
similarity of these distributions will be critical in the deriva- 
tion of our generalization bounds or the design of our algo- 
rithms. This section discusses this question and introduces a 
discrepancy distance relevant to the context of adaptation. 

The li distance yields a straightforward bound on the dif- 
ference of the error of a hypothesis h with respect to Q versus 
its error with respect to P. 

Proposition 1 Assume that the loss L is bounded, L < M 
for some M > 0. Then, for any hypothesis h £ H, 



\CQ{h,f)~Cp{hJ)\<Mh{Q,P). 



(4) 



This provides us with a first adaptation bound suggesting 
that for small values of the li distance between the source 
and target distributions, the average loss of hypothesis h tested 
on the target domain is close to its average loss on the source 
domain. However, in general, this bound is not informative 
since the li distance can be large even in favorable adaptation 
situations. Instead, one can use a distance between distribu- 
tions better suited to the learning task. 

Consider for example the case of classification with the 
0-1 loss. Fix h G H, and let a denote the support of \h — f\. 
Observe that {Cqih, f) - Cp{h,f)\ = \Q{a) - P{a)\. A 
natural distance between distributions in this context is thus 
one based on the supremum of the right-hand side over all 
regions a. Since the target hypothesis / is not known, the 
region a should be taken as the support of \h — h'\ for any 
two h, h' e H. 

This leads us to the following definition of a distance 
originally introduced by Devroye et al. (1996) [pp. 271- 
272] under the name of generalized Kolmogorov-Smirnov 
distance, later by Kifer et al. (2004) as the cIa distance, 
and introduced and applied to the analysis of adaptation in 
classification by Ben-David et al. (2007) and Blitzer et al. 
(2008). 

Definitions (c?a -Distance) Let A C 21^' be a set of subsets 
of X. Then, the d^-distance between two distributions Qi 
and Q2 over X, is defined as 

dA{Qi.Q2)^sMQM-Q2{a)\. (5) 

As just discussed, in 0-1 classification, a natural choice 
for A is A = iJAi? ^{\h' -h\:h,h' £ H}. We introduce 
a distance between distributions, discrepancy distance, that 
can be used to compare distributions for more general tasks, 
e.g., regression. Our choice of the terminology is partly mo- 
tivated by the relationship of this notion with the discrepancy 
problems arising in combinatorial contexts (Chazelle, 2000). 

Definition 4 (Discrepancy Distance) Let H bea set of func- 
tions mapping XtoY and let L: Y xY R_(- define a loss 
function over Y. The discrepancy distance disc^ between 
two distributions Qi and Q2 over X is defined by 



The discrepancy distance is clearly symmetric and it is not 
hard to verify that it verifies the triangle inequality, regard- 
less of the loss function used. In general, however, it does 
not define a distance: we may have discL(Qi, Q2) ~ for 
Qi 7^ Q2, even for non-trivial hypothesis sets such as that of 
bounded linear functions and standard continuous loss func- 
tions. 

Note that for the 0-1 classification loss, the discrepancy 
distance coincides with the dA distance with A = HAH. 
But the discrepancy distance helps us compare distributions 
for other losses such as Lq{y, y') = \y — y'l"^ for some q and 
is more general. 

As shown by Kifer et al. (2004), an important advan- 
tage of the dA distance is that it can be estimated from finite 
samples when A has finite VC-dimension. We prove that 
the same holds for the discL distance and in fact give data- 
dependent versions of that statement with sharper bounds 
based on the Rademacher complexity. 

The following theorem shows that for a bounded loss 
function L, the discrepancy distance disc^ between a distri- 
bution and its empirical distribution can be bounded in terms 
of the empirical Rademacher complexity of the class of func- 
tions Lh = {x i-^ L{h' (x) , h{x)) : h,h' £ H}. In particu- 
lar, when Lh has finite pseudo-dimension, this implies that 
the discrepancy distance converges to zero as Oly^logm/m). 

Proposition 2 Assume that the loss function L is bounded 
by M > 0. Let Q be a distribution over X and let Q denote 
the corresponding empirical distribution for a sample S = 
(I'l, . . . , Xm)- Then, for any d > 0, with probability at least 
1 — 6 over samples S of size m drawn according to Q: 



disCi(Q,g) <9^5(Lh) + 3Mi 



'logf 
2m 



(6) 



Proof: We scale the loss L to [0, 1] by dividing by M, and 
denote the new class by Lh /M- By Theorem 2 applied to 
Lh/M, for any (5 > 0, with probability at least 1 — (5, the 
following inequality holds for all /i, /i' e H: 



CQ{h',h) ^ A 



M 



M 



-msiLn/M) 




The empirical Rademacher complexity has the property that 
*K(ai?) = aSH(i7) for any hypothesis class H and pos- 
itive real number a (Bartlett & Mendelson, 2002). Thus, 
^s{Lh /M) = jj^s{Lh), which proves the proposition. 
■ 

For the specific case of Lg regression losses, the bound 
can be made more explicit. 

Corollary 5 Let H be a hypothesis set bounded by some 
M > for the loss function Lq: Lq{h,h') < M, for all 

h, h' G H. Let Q be a distribution over X and let Q de- 
note the corresponding empirical distribution for a sample 
S — {xi, . . . , Xm)- Then, for any S > 0, with probability at 
least 1 — S over samples S of size m drawn according to Q: 



discL(Qi,(52) 



max 



discL,(Q, Q) < Aqms{H) + MI] 



'logf 
2m 



(7) 



Proof: The function / : x i-^ x'^ is g-Lipschitz for x G [0, 1]: 

\f{x')- .f{x)\<q\x' -x\, (8) 

and /(O) = 0. For L = L,, {x \h'{x) - 

h{x)\'^ : h,h' G H}. Thus, by Talagrand's contraction lemma 

(Ledoux & Talagrand, 1991), ^(L//) is bounded by 2(7^(iJ') 
with H' = {x ^ {h'{x) - h{x)): h,h' e H}. Then, 
^s{H') can be written and bounded as follows 

^s{H') = E[sup-| Va,(/7(x,) - h\x,))\] 

^ h,h' m ^ 
' 1 

< E[sup— I Vcr,/i(a;0|] +E[sup— I Vcri/i'(xi)|] 
h m <^ h' m ^ 

i—i i—i 

using the definition of the Rademacher variables and the sub- 
additivity of the supremum function. This proves the in- 
equality DM^Lh) < 4:q'0l{h) and the corollary. ■ 

A very similar proof gives the following result for classi- 
fication. 

Corollary 6 Let H be a set of classifiers mapping X to {0, 1} 
and let Lqi denote the 0-1 loss. Then, with the notation of 
Corollary 5, for any 6 > Q, with probability at least 1 — 6 
over samples S of size m drawn according to Q: 



discLoAQ,Q)<^^s{H)+3^^^. (9) 

The factor of 4 can in fact be reduced to 2 in these corollar- 
ies when using a more favorable constant in the contraction 
lemma. The following corollary shows that the discrepancy 
distance can be estimated from finite samples. 

Corollary 7 Let H be a hypothesis set bounded by some 
M > for the loss function L^: Lq{h,h') < M, for all 

h, h! G H. Let Q be a distribution over X and Q the cor- 
responding empirical distribution for a sample S, and let P 
be a distribution over X and P the corresponding empiri- 
cal distribution for a sample T. Then, for any (5 > 0, with 
probability at least 1 — 6 over samples S of size m drawn 
according to Q and samples T of size n drawn according to 
P: 

disCL,(P,0) < disCi,(P,g) + 

.,(s.,.,.s.m).3M(/!ii./M). 

Proof: By the triangle inequality, we can write 

disci, (P, Q) < disci, (P, P) + discL, (P, Q)+ 

disci, (g,Q). (10) 

The result then follows by the appUcation of Corollary 5 to 
disCi, (P, P) and discL, (Q, Q). ■ 

As with Corollary 6, a similar result holds for the 0-1 loss 
in classification. 



4 Domain Adaptation: Generalization 
Bounds 

This section presents generalization bounds for domain adap- 
tation given in terms of the discrepancy distance just defined. 
In the context of adaptation, two types of questions arise: 

(1) we may ask, as for standard generalization, how the 
average loss of a hypothesis on the target distribution, 
Cp{h, /), differs from CQ{h, /), its empirical error based 

on the empirical distribution Q; 

(2) another natural question is, given a specific learning al- 
gorithm, by how much does Cp{hQ,f) deviate from 
^p{hp, f) where hq is the hypothesis returned by the 
algorithm when trained on a sample drawn from Q and 
hp the one it would have returned by training on a sam- 
ple drawn from the true target distribution P. 

We will present theoretical guarantees addressing both ques- 
tions. 

4.1 Generalization bounds 

Let hg £ argmin^g^ 'Cq(/i, /q) and similarly let hp be a 
minimizer of Cp{h, fp). Note that these minimizers may 
not be unique. For adaptation to succeed, it is natural to 
assume that the average loss CQ{h*Q, h*p) between the best- 
in-class hypotheses is small. Under that assumption and for a 
small discrepancy distance, the following theorem provides 
a useful bound on the error of a hypothesis with respect to 
the target domain. 

Theorem 8 Assume that the loss function L is symmetric 
and obeys the triangle inequality. Then, for any hypothesis 
h e H, the following holds 

Cp{h, fp) < Cp{h*pjp) + Cgih, h*Q) + disc(P, 0) 

+ CQ{h*Q,h*p). (11) 

Proof: Fix h e H. By the triangle inequality property of 
L and the definition of the discrepancy disci, (P, Q), the fol- 
lowing holds 

Cp{h, fp) < Cp{h, h*Q) + Cp{h*Q,h*p) + Cp{h*pjp) 
< CQ{h,h*Q) + discL{P,Q) + Cp{h*Q,h*p) 
+ Cp{h*p,fp). ■ 

We compare (11) with the main adaptation bound given by 
Ben-David et al. (2007) and BHtzer et al. (2008): 

Cp{h, fp) < CQ{h, /q) + disci(P, Q) + 

mm{CQ{hjQ)+Cp{hJp)). (12) 

It is very instructive to compare the two bounds. Intuitively, 
the bound of Theorem 8 has only one error term that involves 
the target function, while the bound of (12) has three terms 
involving the target function. One extreme case is when there 
is a single hypothesis h in H and a single target function 
/. In this case. Theorem 8 gives a bound of Cp{h,f) + 
disc(P, Q), while the bound supplied by (12) is 2£q(/i, /) + 
Cp{h, f) + disc(P, Q), which is larger than 3Cp{h, f) + 



disc(P, Q) when Cqih, f) < Cp{h, /). One can even see 
that the bound of (12) might become vacuous for moderate 
values of £q(/i, /) and Cp{h, f). While this is clearly an 
extreme case, an error with a factor of 3 can arise in more 
realistic situations, especially when the distance between the 
target function and the hypothesis class is significant. 

While in general the two bounds are incomparable, it 
is worthwhile to compare them using some relatively plau- 
sible assumptions. Assume that the discrepancy distance 
between P and Q is small and so is the average loss be- 
tween Hq and h*p. These are natural assumptions for adap- 
tation to be possible. Then, Theorem 8 indicates that the 
regret Cp{h, fp) — Cp{hp, fp) is essentially bounded by 
CQ{h, hg), the average loss with respect to hg on Q. We 
now consider several special cases of interest. 

(i) When /iq = h*p then h* — h*Q = h*p and the bound of 
Theorem 8 becomes 

Cp{hjp) < Cp(h\fp)+CQ{h,h*)+d\sc{P,Q). 

(13) 

The bound of (12) becomes 

Cp{Kfp)<Cp{h*jp) + CQ{h,fQ)^ 

CQ{h*jQ) + d\sc{P,Q), 

where the right-hand side essentially includes the sum 
of 3 errors and is always larger than the right-hand side 
of (13) since by the triangle inequality CQ{h,h*) < 

CQ{h,fQ)+£Q{h*jQ). 

(ii) When h*Q ^ h*p = h* A disc(P, Q) = 0, the bound of 
Theorem 8 becomes 

Cp{hjp)<Cp{h\fp) + CQ{Kh*), 

which coincides with the standard generalization bound. 
The bound of (12) does not coincide with the standard 
bound and leads to: 

Lp{hJp)<Cp{h\fp)+CQ{hjQ)+CQ{h\fQ). 

(iii) When fpEH (consistent case), the bound of (12) sim- 
plifies to, 

\Cpih, fp) - Cgih, fp)\ < disCL(g, P), 

and it can also be derived using the proof of Theorem 8. 

Finally, clearly Theorem 8 leads to bounds based on the em- 
pirical error of ft, on a sample drawn according to Q. We 
give the bound related to the 0-1 loss, others can be derived 
in a similar way from Corollaries 5-7 and other similar corol- 
laries. The result follows Theorem 8 combined with Corol- 
lary 7, and a standard Rademacher classification bound (The- 
orem 14) (Bardett & Mendelson, 2002). 

Theorem 9 Let H be a family of functions mapping X to 
{0, 1} and let the rest of the assumptions be as in Corol- 
lary 7. Then, for any hypothesis h € H, with probability at 
least I — d, the following adaptation generalization bound 
holds for the 0-1 loss: 

Cp{h,fp)-Cp{h*p,fp) < 
C^{h,h*Q)+discLoAP,Q) + iM+l)^s{H)+4qmT{H)+ 




Figure 1: In this example, the gray regions are assumed to 
have zero support in the target distribution P. Thus, there 
exist consistent hypotheses such as the linear separator dis- 
played. However, for the source distribution Q no linear sep- 
aration is possible. 

4.2 Guarantees for regularization-based algorithms 

In this section, we first assume that the hypothesis set H in- 
cludes the target function fp. Note that this does not imply 
that /q is in H. Even when fp and fg are restrictions to 
supp(P) and supp((3) of the same labeling function /, we 
may have fp G H and fg ^ H and the source problem 
could be non-realizable. Figure 1 illustrates this situation. 

For a fixed loss function L, we denote by Rgih) the em- 
pirical error of a hypothesis h with respect to an empirical 
distribution Q: RQ{h) = C^ikJ). Let N : ^ M+ be 
a function defined over the hypothesis set H. We will as- 
sume that _ff is a convex subset of a vector space and that 
the loss function L is convex with respect to each of its argu- 
ments. Regularization-based algorithms minimize an objec- 
tive of the form 

F^ih) = RQih) + XN{h), (15) 

where A > is a trade-off parameter This family of al- 
gorithms includes support vector machines (SVM) (Cortes 
& Vapnik, 1995), support vector regression (SVR) (Vapnik, 
1998), kernel ridge regression (Saunders et al., 1998), and 
other algorithms such as those based on the relative entropy 
regularization (Bousquet & Elisseeff, 2002). 

We denote hy Bp the Bregman divergence associated to 
a convex function F, 

Bpifh) = F{f) - Fig) -if- .g, VF(g)) (16) 

and define Ah as Ah = h' — h. 

Lemma 10 Let the hypothesis set H be a vector space. As- 
sutne that N is a proper closed convex function and that N 
and L are differentiable. Assume that Fq admits a minimizer 
h Cz H and Fp a minimizer h' H and that fp and /g co- 
incide on the support of Q. Then, the following bound holds, 

Bj,{h'\\h) + BN{h\\h')<'^-^^^^^^^^. (17) 

Proof: Since Bi^. = +\Bn Bp. = Bj, +XBn, 

and a Bregman divergence is non-negative, the following in- 
equality holds: 

X{BN{h'\\h:)+BN{h\\h')) < BF^{h'\\h) + BF^{h\\h'). 



By the definition of h and h' as the minimizers of and 

Fp, VgFih) = VpF{h') = and 



\{Bp^{h'\\h) 



= RQ{h') - RQ{h) + Rp{h) 



Rp{h') 



- {Cp{h'Jp)-CQ{h'Jp)) < 2discL(P,Q). 

This last inequality holds since by assumption fp is in H. ■ 

We will say that a loss function L is a -admissible when 
there exists a G IR+ such that for any two hypotheses h, h' E 
H and for all x G X, and y £ Y, 



\L{h{x),y) - L{h'{x),y)\ < a\h{x) - h'{x)\ 



(18) 



This assumption holds for the hinge loss with <t = 1 and 
for the Lq loss with a — q{2M)'^~^ when the hypothesis set 
and the set of output labels are bounded by some M E M+: 
V/i e H,yx e X, \h{x)\ < M and £ Y, \y\ < M. 



Theorem 11 Let K: XxX ^ M.be a positive-definite sym- 
metric kernel such that K(x, x) < < oo for all x £ X, 
and let H be the reproducing kernel Hilbert space associ- 
ated to K. Assume that the loss function L is a-admissible. 
Let h! be the hypothesis returned by the regularization algo- 
rithm based on N (■) = for the empirical distribution 

P, and h the one returned for the empirical distribution Q, 
and that and that fp and /g coincide on supp((5). Then, for 
allx £ X,y £ Y, 



\L{h'{x),y)--L{h{x),y)\ < Ka\ 



'disCL(P,Q) 



A 



(19) 



Proof: ForiV(-) = ||-|||^, is a proper closed convex func- 
tion and is differentiable. We have BN{h'\\h) = \\h' — 
thus BNih'\\h) + BN{h\\h') = 2||A/i|||.. When L is differ- 
entiable, by Lemma 10, 



2\\Ah\\ 



K 



< 



2discL (P,Q) 
A 



(20) 



This result can also be shown directly without assuming that 
L is differentiable by using the convexity of N and the mini- 
mizing properties of h and h' with a proof that is longer than 
that of Lemma 10. 

Now, by the reproducing property of H, for all x £ H, 
Ah{x) = {Ah, K{x,-)) and by the Cauchy-Schwarz in- 
equality, \Ah{x)\ < \\Ah\\K{K{x,x))^^'^ < k||A/i||a'. By 
the cr-admissibility of L, for all x £ X, y £ Y, 



\L{h'{x),y) - L{h{x),y)\ < a\Ah{x)\ < A/i|| 



K, 



which, combined with (20), proves the statement of the the- 
orem. ■ 

Theorem 1 1 provides a guarantee on the pointwise dif- 
ference of the loss for h' and h with probability one, which 
of course is stronger than a bound on the difference between 
expected losses or a probabilistic statement. The result, as 
well as the proof, also suggests that the discrepancy distance 



is the "right" measure of difference of distributions for this 
context. The theorem applies to a variety of algorithms, in 
particular SVMs combined with arbitrary PDS kernels and 
kernel ridge regression. 

In general, the functions fp and fq may not coincide on 
supp(Q). For adaptation to be possible, it is reasonable to 
assume however that 

LQ{fQ{x),fp{x)) ^1 and Lp{fQ{x), fp{x)) 1. 

This can be viewed as a condition on the proximity of the 
labeling functions (the Ys), while the discrepancy distance 
relates to the distributions on the input space (the Xs). The 
following result generalizes Theorem 11 to this setting in the 
case of the square loss. 

Theorem 12 Under the assumptions of Theorem 11, but with 
Jq and fp potentially different on supp((5), when L is the 
square loss L2 and 5^ = L^{fQ{x), fp{x)) <C 1, then, for 
all X £ X, y £ Y, 

\Lih'ix),y)-L{hix),y)\ < 
2nM 



A 



(52+4AdiscL(P,g) . (21) 



Proof: Proceeding as in the proof of Lemma 10 and using 
the definition of the square loss and the Cauchy-Schwarz in- 
equality give 

\{BF^^{h'\\h) + BF^ih\\h')) 

= RQ(h') - RQ{h) + Rp{h) - Rp{h') 

= {Cp{hjp)-C^ihjp)) 

-{Cp{h'Jp)^CQ{h'Jp)) 

+ 2Y.[{h'{x)-h{x)){fp{x)~ fQ{x)] 
Q 

< 2discL(P,Q) + 2 



E[AMx)2]E[L(/p(x),/Q(a:))] 
Q Q 

< 2discL(P, Q) + 2K\\Ah\\KS. 
Since N{-) = 1 1 • 1 1 1^ , the inequality can be rewritten as 

A||A/i|||. < discL{P,Q) + KS\\Ah\\K. (22) 

Solving the second-degree polynomial in ||A/i||a' leads to 
the equivalent constraint 



\Ah\\ 



K 



< 



1 

2A 



k5 



2(52 +4AdiscL(P,Q)). (23) 



The result then follows by the cr-admissibility of L as in the 
proof of Theorem 11, with a = 4AI. ■ 

Using the same proof schema, similar bounds can be de- 
rived for other loss functions. 

When the assumption fp £ H is relaxed, the following 
theorem holds. 

Theorem 13 Under the assumptions of Theorem 11, but with 
fp not necessarily in H and fq and fp potentially differ- 
ent on supp(Q), when L is the square loss L2 and 6' = 



LQih*pix)jQ{x)y/^ + Lpih*pix)Jp{x)y/^ « 1, then, 
for allx £ X, y e Y, 

\Lih'{x),y)-Lihix),y)\ < 

^ + K^6'^ + 4AdisCi(P, 0)) . (24) 

Proof: Proceeding as in the proof of Theorem 12 and us- 
ing the definition of the square loss and the Cauchy-Schwarz 
inequality give 

X{Bp^ih'\\h) + Bp^ih\\h')) 

= {Cp{h, h*p) - Cg{h, h*p)) 

- {Cp{h\h*p)^C^{h',h*p)) 

-2E[{h'{x)~h{x)){h*p{x)- fp{x)] 
p 

+ 2^[ih\x) ~ h{x)){h*pi^) ~ fgix)] 

Q 

< 2discL(P, Q) + 2, lE[Ah{x)^]'E[Lih*p{x)Jpix))] 

\j p p 

+ 2 l^[Ah{xY]Y.[L{h*p{x)jQ{x))] 
\ Q Q 

< 2discL(P, Q) + 2K\\Ah\\KS'. 

The rest of the proof is identical to that of Theorem 12. ■ 

5 Discrepancy Minimization Algorithms 

The discrepancy distance discL(P, Q) appeared as a critical 
term in several of the bounds in the last section. In particular, 
Theorems 11 and 12 suggest that if we could select, instead 
of Q, some other empirical distribution Q' with a smaller 
empirical discrepancy disCi(P, Q') and use that for training 
a regularization-based algorithm, a better guarantee would 
be obtained on the difference of pointwise loss between h' 
and h. Since h' is fixed, a sufficiently smaller discrepancy 
would actually lead to a hypothesis h with pointwise loss 
closer to that of h'. 

The training sample is given and we do not have any con- 
trol over the support of Q. But, we can search for the distri- 
bution Q' with the minimal empirical discrepancy distance: 

Q' = argmindisCi(P, Q'), (25) 
Q'eQ 

where Q denotes the set of distributions with support supp(Q) 
This leads to an optimization problem that we shall study in 
detail in the case of several loss functions. 

Note that using Q' instead of Q for training can be viewed 
as reweighting the cost of an error on each training point. 
The distribution Q' can be used to emphasize some points 
or de-emphasize others to reduce the empirical discrepancy 
distance. This bears some similarity with the reweighting or 
importance weighting ideas used in statistics and machine 
learning for sample bias correction techniques (EUcan, 2001; 
Cortes et al., 2008) and other purposes. Of course, the ob- 
jective optimized here based on the discrepancy distance is 
distinct from that of previous reweighting techniques. 



We will denote by Sq the support of Q, by Sp the sup- 
port of P, and by S their union supp(Q) U supp(P), with 
[•^qI ~ "^0 ™ ^nd \ Sp\ ~ no < n. 

In view of the definition of the discrepancy distance, prob- 
lem (25) can be written as a min-max problem; 

Q' = argmin ma.x \Cp{h' , h) — CQ,{h' , h)\. (26) 
Q'eQ ''-'''^^ 

As with all min-max problems, the problem has a natural 
game theoretical interpretation. However, here, in general, 
we cannot permute the min and max operators since the 
convexity-type assumptions of the minimax theorems do not 
hold. Nevertheless, since the max-min value is always a 
lower bound for the min-max, it provides us with a lower 
bound on the value of the game, that is the minimal discrep- 
ancy: 

max mm\£p{h' ,h) — CQ,{h' ,h)\ < 

min max \Cp{h' , h) ~ CQ,{h' , h)\. (27) 

We will later make use of this inequality. Let us now examine 
the minimization problem (25) and its algorithmic solutions 
in the case of classification with the 0-1 loss and regression 
with the L2 loss. 

5.1 Classification, 0-1 Loss 

For the 0-1 loss, the problem of finding the best distribution 
Q' can be reformulated as the following min-max program: 

min max \Q'(a) — P(a)\ (28) 

Q' aeHAH ^ ' ^ 'I 

subject to Va- e Sq,Q'{x) > A ^ Q'{x) = l, (29) 

xGSq 

where we have identified HAH — — h\: h,h' E H} 
with the set of regions a C X that are the support of an 
element of HAH. This problem is similar to the min-max 
resource allocation problem that arises in task optimization 
(Karabati et al., 2001). It can be rewritten as the following 
hnear program (LP): 

min S (30) 

Q' 

subject to Va e HAH, Q'{a) - P{a) < 5 (31) 

Va e HAH, P{a) - Q'{a) < 5 (32) 

Vx e Sq,Q'{x) > a ^ Q'{x) = 1. (33) 

xeSq 

The number of constraints is proportional to \HAH\ but it 
can be reduced to a finite number by observing that two sub- 
sets a,a' £ HAH containing the same elements of S lead to 
redundant constraints, since 

|g'(a) - P{a) \ = |g'(a') - P{a')\. (34) 

Thus, it suffices to keep one canonical member a for each 
such equivalence class. The necessary number of constraints 
to be considered is proportional to TLHAHimo + uq), the 
shattering coefficient of order (jtiq + ?^o) of the hypothesis 



(a) 



(b) 

Figure 2: Illustration of the discrepancy minimization algo- 
rithm in dimension one. (a) Sequence of labeled (red) and 
unlabeled (blue) points, (b) The weight assigned to each la- 
beled point is the sum of the weights of the consecutive blue 
points on its right. 

class HAH. By the Sauer's lemma, this is bounded in terms 
of the VC-dimension of the class HAH, H/zAff ("^o+^-o) < 
0{{nio+no)^'^^"^"^ ), which can be bounded by O((mo + 
„Q)2yc(ff)-) gjjj^g jjQj Yiecrd to see that VC{HAH) < 
2VC{H). 

In cases where we can test efficiently whether there exists 
a consistent hypothesis in H, e.g., for half-spaces in R'^, we them is zero Since 7' 



Proposition 3 Assume that X consists of the set of points on 
the real line and H the set of half-spaces on X. Then, for any 
Q and P, Q'{si) = rii/n minimizes the empirical discrep- 
ancy and can be computed in time 0{{m + n) log(m + n)). 

Proof: Consider an interval [zi, Z2] that maximizes the dis- 
crepancy of Q'. The case of a complement of an interval is 
the same, since the discrepancy of a hypothesis and its nega- 
tion are identical. Let Si, . . . ,Sj G [zi, Z2] be the subset of 
Q in that interval, and pi', . . . ,pji £ [zi , Z2] be the subset of 
P in that interval. The discrepancy is d = | J2k=i Q'i^k) — 
~' |. By our definition of Q' , we have that Y^k^i Q'i^k) = 
i X]fc=i "-fc- Let Pill be the maximal point in P which is less 
than Si and j" the minimal point in P larger than sj . We have 
that / -i' = (i" - i') + Yj'k'J- "fc + (/' - j'))- Therefore 
d^\{r"-r') + {f'-j')~n,\ = ^i')-{n,-{f' ^ ^ 
Since d is maximal and both terms are non-negative, one of 



can generate in time O((mo + n^Y'^) all consistent labeling 
of the sample points by H. (We remark that computing the 
discrepancy with the 0-1 loss is closely related to agnostic 
learning. The implications of this fact will be described in a 
longer version of this paper) 

5.2 Computing the Discrepancy in ID 

We consider the case where X = [0, 1] and derive a simple 
algorithm for minimizing the discrepancy for 0-1 loss. Let 
H be the class of all prefixes (i.e., [0, z]) and suffixes (i.e., 
[z, 1]). Our class of HAH includes all the intervals (i.e., 
(zi, Z2]) and their complements (i.e., [0, zi] U (z2, 1]). We 
start with a general lower bound on the discrepancy. 

Let U denote the set of unlabeled regions, that is the set 
of regions a such that a n Sq = and a n 5p 7^ 0. If a is an 

unlabeled region, then \Q'{a) — P{a)\ = Pio) for any Q' . 
Thus, by the max-min inequality (27), the following lower 
bound holds for the minimum discrepancy: 

max P(a) < min max \Cp{h' ,h) — CQi{h' ,h)\. (35) 

a£U Q'£Qh,h'eH ^ 

In particular, if there is a large unlabeled region a, we cannot 
hope to achieve a small empirical discrepancy. 

In the one-dimensional case, we give a simple linear-time 
algorithm that does not require an LP and show that the lower 
bound (35) is reached. Thus, in that case, the min and max 
operators commute and the minimal discrepancy distance is 

precisely minaeu ^(a)- 

Given our definition of H, the unlabeled regions are open 
intervals, or complements of these sets, containing only points 
from Sp with endpoints defined by elements of Sq. 

Let us denote by si, . . . , s,„q the elements of Sq, by rii, 
i G [1, mo], the number of consecutive unlabeled points to 
the right of Si and n = We will make an additional 

technical assumption that there are no unlabeled points to 
the left of si. Our algorithm consists of defining the weight 
Q'{si) as follows: 

Q\si) = n,/n. (36) 

This requires first sorting SqU Sp and then computing rii 
for each s, . Figure 2 illustrates the algorithm. 



j" < nj and i" 



< rii, the 

discrepancy of Q' meets the lower bound of (35) and is thus 
optimal. ■ 

5.3 Regression, L2 loss 

For the square loss, the problem of finding the best distribu- 
tion can be written as 



mm max 

Q'^QhJi'eH 



E[{h'{x) - h{x)Y] - E[{h\x) ~ h{x)f] 

P Q' 



If X is a subset of M^, iV > 1, and the hypothesis set H is 
a set of bounded linear functions H = {x 1-^ w^x: ||w|| < 
1}, then, the problem can be rewritten as 



min max E[((w' - w)^x)2] - E [((w' - w)^x)2] 

Q'eQ l|w||<l I p Q' 
||w'||<l 

= min max I V(P(x) - g'(x))[(w' - w)'^yif 



'eS l|w||<i 

||w'||<l 



min max | y^(P(x) - Q'(x))[u^x]' 
q'gcII"II<2 



mm max u 

Q'eQll"ll<2 I 



X(P(x)-Q'(x))xx^)u 



(37) 

We now simplify the notation and denote by Si , . . . , Sm„ the 
elements of Sq, by z.; the distribution weight at point s^: 
Zi — Q'{si), and by M(z) G a symmetric matrix that is 
an affine function of z: 



M(z) = Mo 



mo 

E 

1=1 



Z,M,; 



(38) 



where Mo = X!xes ^(x)xx^ and M^ = s^s^. Since prob- 
lem (37) is invariant to the non-zero bound on ||u||, we can 
equivalently write it with a bound of one and in view of the 
notation just introduced give its equivalent form 



min maxlu M(z)u|. 

||z||i=l||u||=l' 
z>0 



(39) 



Since M(z) is symmetric, max||u||=i u^M(z)u is the max- 
imum eigenvalue Amax of M(z) and the problem is equiva- 
lent to the following maximum eigenvalue minimization for 
a symmetric matrix: 



mill max{ A,, 

l|z||i=i 

z>0 



,(M(z)),A,„ax(-M(z))}, 



(40) 



This is a convex optimization problem since the maximum 
eigenvalue of a matrix is a convex function of that matrix 
and M is an affine function of z, and since z belongs to a 
simplex. The problem is equivalent to the following semi- 
definite programming (SDP) problem: 



mm 

z,A 



A 



subject to AI - M(z) ^0 
AI + M(z) h 

l^z = 1 A z > 0. 



(41) 

(42) 
(43) 
(44) 



SDP problems can be solved in polynomial time using gen- 
eral interior point methods (Nesterov & Nemirovsky, 1994). 
Thus, using the general expression of the complexity of inte- 
rior point methods for SDPs, the following result holds. 

Proposition 4 Assume that X is a subset of and that 
the hypothesis set H is a set of bounded linear functions 
H ^ {x 1-^ w^x: ||w|| < 1}. Then, for any Q and P, the 
discrepancy minimizing distribution Q' for the square loss 
can be found in time 0{'m^N^'^ + rioN'^). 

It is worth noting that the unconstrained version of this prob- 
lem (no constraint on z) and other close problems seem to 
have been studied by a number of optimization publications 
(Fletcher, 1985; Overton, 1988; Jarre, 1993; Helmberg & 
Oustry, 2000; Alizadeh, 1995). This suggests possibly more 
efficient specific algorithms than general interior point meth- 
ods for solving this problem in the constrained case as well. 
Observe also that the matrices have a specific structure 
in our case, they are rank-one matrices and in many appli- 
cations quite sparse, which could be further exploited to im- 
prove efficiency. 

6 Experiments 

This section reports the results of preliminary experiments 
showing the benefits of our discrepancy minimization algo- 
rithms. Our results confirm that our algorithm is effective 
in practice and produces a distribution that reduces the em- 
pirical discrepancy distance, which allows us to train on a 
sample closer to the target distribution with respect to this 
metric. They also demonstrate the accuracy benefits of this 
algorithm with respect to the target domain. 

Figures 3(a)-(b) show the empirical advantages of using 
the distribution Q' returned by the discrepancy minimizing 
algorithm described in Proposition 3 in a case where source 
and target distributions are shifted Gaussians: the source dis- 
tribution is a Gaussian centered at — 1 and the target distribu- 
tion a Gaussian centered at +1, both with standard deviation 
2. The hypothesis set used was the set of half-spaces and 
the target function selected to be the interval [—1,1]. Thus, 
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Figure 3: Example of application of the discrepancy mini- 
mization algorithm in dimensions one. (a) Source and target 
distributions Q and P. (b) Classification accuracy empiri- 
cal results plotted as a function of the number of training 
points for both the unweighted case (using original empiri- 
cal distribution Q) and the weighted case (using distribution 
Q' returned by our discrepancy minimizing algorithm). The 
number of unlabeled points used was ten times the number 
of labeled. Error bars show ±1 standard deviation. 
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Figure 4: (a) An {xi,X2,y) plot of Q (magenta), P (green), 
weighted (red) and unweighted (blue) hypothesis, (b) Com- 
parison of mean-squared error for the hypothesis trained on 
Q (top), trained on Q' (middle) and on P (bottom) over, a 
varying number of training points. 

training on a sample drawn form Q generates a separator 
at —1 and errs on about half of the test points produced by 
P. In contrast, training with the distribution Q' minimizing 
the empirical discrepancy yields a hypothesis separating the 
points at +1, thereby dramatically reducing the error rate. 

Figures 4(a)-(b) show the application of the SDP derived 
in (41) to determining the distribution minimizing the em- 
pirical discrepancy for ridge regression. In Figure 4(a), the 
distributions Q and P are Gaussians centered at (v^, V^) 
and ( — \/2, — a/2), both with covariance matrix 21. The tar- 
get function is f{xi,X2) = (1 — |.ti|) + (1 — \x2\), thus 
the optimal linear prediction derived from Q has a negative 
slope, while the optimal prediction with respect to the target 
distribution P in fact has a positive slope. Figure 4(b) shows 
the performance of ridge regression when the example is ex- 
tended to 16-dimensions, before and after minimizing the 
discrepancy. In this higher-dimension setting and even with 
several thousand points, using (http://sedumi.ie.lehigh.edu/), our 
SDP problem could be solved in about 15s using a single 
3GHz processor with 2GB RAM. The SDP algorithm yields 
distribution weights that decrease the discrepancy and assist 
ridge regression in selecting a more appropriate hypothesis 
for the target distribution. 



7 Conclusion 

We presented an extensive theoretical and an algorithmic anal- 
ysis of domain adaptation. Our analysis and algorithms are 
widely applicable and can benefit a variety of adaptation tasks. 
More efficient versions of these algorithms, in some instances 
efficient approximations, should further extend the applica- 
bility of our techniques to large-scale adaptation problems. 

A Proof of Theorem 2 

Proof: Let $(5) be defined by $(5) = sup,^g^ R{h) - 

R{h). Changing a point of S affects $(5) by at most 1/m. 
Thus, by McDiarmid's inequality applied to $(5), for any 
S > 0, with probability at least 1 
for all he H: 



J, the following holds 



$(5) < E [$(5)] 




(45) 



Fig^D [$(5)] can be bounded in terms of the empirical Rade- 
macher complexity as follows: 

miS)] [sup E[Rs>{h)]-Rs{h)] 
= E[supE[Rs'ih)-Rsih)]] 
< E 

5,5 



sup Rg' (h) 



Rs{h)] 



5,5' ^hfzH m 



E 

o-.5,5' 



[sup — Y]CTi(/i(a;,-) - h{xi))] 



1 



<T,5' '-ftgjj m 



-j^ lib 

E \ sup — -aih{x.i)] 



= 2 E [ sup — aih{xi)\ < 2 E [ sup I — aihixA I 

<J,S^htzHm^ <T,5 /jg/^ ' TO ^ ' 

= ^UH). 

Changing a point of S affects 9^„j (H) by at most 2/m. Thus, 
by McDiarmid's inequality applied to JKmiH), with proba- 
bility at least 1 — (5/2, the following holds; 



'21og| 



(46) 



Combining this inequality with Inequality (45) and the bound 
on E5[$(iS)] above yields directly the statement of the theo- 



rem. 



B Rademacher Classification Bound 

Theorem 14 (Rademacher Classification Bound) Let H be 

a family of functions mapping X to {0, 1} and let Lqi denote 
the 0-1 loss. Let Q be a distribution over X. Then, for any 
(5 > 0, with probability at least \ — 5, the following inequality 
holds for all samples S of size m drawn according to Q: 



jOoiQ{h,h*Q) < Co,Q{h,h*Q)+ms{H)/2 + 




(47) 



C Discrepancy Minimization with Kernels 
and L2 loss 

Here, we show how to generalize the results of Section 5.3 to 
the high-dimensional case where H is the reproducing ker- 
nel Hilbert space associated to a positive definite symmetric 
(PDS) kernel K. 

Proposition 5 Let K be a PDS kernel and let H denote the 
reproducing kernel Hilbert space associated to K. Then, for 
any Q and P, the problem of determining the discrepancy 
minimizing distribution Q' for the square loss can be cast 
an SDP depending only on the Gram matrix of the kernel 
function K and solved in time 0(mQ(TOo + no)^'^+no(TOo + 



be a feature mapping associated 



Proof: Let $: X 

with K. Let po = nio + "-o- Here, we denote by si , . . . , o,„q 
the elements of Sq and by Smo+i, . . . , Sp^ the element of 

Sp. We also define z,; = Q'{si) for i g [1,too], and for 
convenience Zi = for i S [toq + 1, toq + no]. Then, by 
Proposition 4, the problem of finding the optimal distribution 
Q' is equivalent to 

min {A„ax(M(z)),A,„ax(-M(z))}, (48) 

l|z||i=l 
z>0 

Where M(z) = Y.T=l{Pis^) - z,)^{si)^s,y . Let * de- 
note the matrix in R^^pq whose columns are the vectors 
$(si), . . . , $(smQ_|_„o). Then, observe that M(z) can be 
rewritten as 

M(z) = *A*^, (49) 
where A is the diagonal matrix 

A = diag(P(si ) - zi , . . . , P(sp„ ) ~ Zp J . (50) 

Fix z. There exists to G K such that, for all t > to^ B = 
A + tl is a positive definite symmetric matrix. For any such 
t, let N'(z) denote 

N'(z) = *B*T. (51) 

Since B is positive definite, there exists a diagonal matrix 
Bi/2 g RPoxpo such that B ^ B^/^B^/^. Thus, we can 
write N'(z) as N'(z) = YY^ with Y = ^B^/^. YY^ 
and Y^Y have the same characteristic polynomial modulo 
multipHcation by X'^^P". Thus, since = K, the 

Gram matrix of kernel K for the sample S, N'(z) has the 
same same characteristic polynomial modulo multiplication 
by X^~P» as 

N"(z) = YY^ = B^/^KB^/^ (52) 

Now, N"(z) can be rewritten as N"(z) = ZZ^ with Z = 
Bi/2j^i/2 ugjjjg jjjg f^^j jjj^j 2;ZT and Z^Z have the 

same characteristic polynomial, this shows that N'(z) has 
the same characteristic polynomial modulo multiplication by 

X^~P° as 

N"'(z) = (53) 
Thus, assuming without loss of generality that N > po, the 
following equality between polynomials in X holds for all 

t>to: 

dct(XI - *B*^) = X^-P" dct(XI - K^/^BKi/^). 

(54) 



Both determinants are also polynomials in t. Thus, for every 
fixed value of X, this is an equality between two polynomials 
in t for all t > Iq. Thus, the equality holds for all t, in 
particular for t ^ Q, which implies that M(z) = ^•A^^ has 
the same non-zero eigenvalues as M'(z) — K^/^AK^/^. 
Thus, problem (48) is equivalent to 

min {A„,ax(M'(z)),A„,ax(-M'(z))}. (55) 

l|z|li=l 
z>0 

Let Ao denote the diagonal matrix 

Ao =diag(F(si),...,P(spJ), (56) 

and for i G [1,too], let h S Kpo><po denote the diagonal 
matrix whose diagonal entries are all zero except from the 
ith one which equals one. Then, 

M'(z) =M[,-^z,M^ (57) 

i=l 

with = Ki/2AoKi/2 and M', = K^/%K^/^ for i e 
[1, Too]- Thus, M'(z) is an affine function of z and problem 
(55) is a convex optimization problem that can be cast as an 
SDP, as described in Section 5.3, in terms of the Gram matrix 
K of the kernel function K. ■ 
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